55 research outputs found
Influence-Optimistic Local Values for Multiagent Planning --- Extended Version
Recent years have seen the development of methods for multiagent planning
under uncertainty that scale to tens or even hundreds of agents. However, most
of these methods either make restrictive assumptions on the problem domain, or
provide approximate solutions without any guarantees on quality. Methods in the
former category typically build on heuristic search using upper bounds on the
value function. Unfortunately, no techniques exist to compute such upper bounds
for problems with non-factored value functions. To allow for meaningful
benchmarking through measurable quality guarantees on a very general class of
problems, this paper introduces a family of influence-optimistic upper bounds
for factored decentralized partially observable Markov decision processes
(Dec-POMDPs) that do not have factored value functions. Intuitively, we derive
bounds on very large multiagent planning problems by subdividing them in
sub-problems, and at each of these sub-problems making optimistic assumptions
with respect to the influence that will be exerted by the rest of the system.
We numerically compare the different upper bounds and demonstrate how we can
achieve a non-trivial guarantee that a heuristic solution for problems with
hundreds of agents is close to optimal. Furthermore, we provide evidence that
the upper bounds may improve the effectiveness of heuristic influence search,
and discuss further potential applications to multiagent planning.Comment: Long version of IJCAI 2015 paper (and extended abstract at AAMAS
2015
The Role of Diverse Replay for Generalisation in Reinforcement Learning
In reinforcement learning (RL), key components of many algorithms are the
exploration strategy and replay buffer. These strategies regulate what
environment data is collected and trained on and have been extensively studied
in the RL literature. In this paper, we investigate the impact of these
components in the context of generalisation in multi-task RL. We investigate
the hypothesis that collecting and training on more diverse data from the
training environment will improve zero-shot generalisation to new
environments/tasks. We motivate mathematically and show empirically that
generalisation to states that are "reachable" during training is improved by
increasing the diversity of transitions in the replay buffer. Furthermore, we
show empirically that this same strategy also shows improvement for
generalisation to similar but "unreachable" states and could be due to improved
generalisation of latent representations.Comment: 14 pages, 8 figure
E-MCTS: Deep Exploration in Model-Based Reinforcement Learning by Planning with Epistemic Uncertainty
One of the most well-studied and highly performing planning approaches used
in Model-Based Reinforcement Learning (MBRL) is Monte-Carlo Tree Search (MCTS).
Key challenges of MCTS-based MBRL methods remain dedicated deep exploration and
reliability in the face of the unknown, and both challenges can be alleviated
through principled epistemic uncertainty estimation in the predictions of MCTS.
We present two main contributions: First, we develop methodology to propagate
epistemic uncertainty in MCTS, enabling agents to estimate the epistemic
uncertainty in their predictions. Second, we utilize the propagated uncertainty
for a novel deep exploration algorithm by explicitly planning to explore. We
incorporate our approach into variations of MCTS-based MBRL approaches with
learned and provided dynamics models, and empirically show deep exploration
through successful epistemic uncertainty estimation achieved by our approach.
We compare to a non-planning-based deep-exploration baseline, and demonstrate
that planning with epistemic MCTS significantly outperforms non-planning based
exploration in the investigated deep exploration benchmark.Comment: Submitted to NeurIPS 2023, accepted to EWRL 202
Optimal and Approximate Q-value Functions for Decentralized POMDPs
Decision-theoretic planning is a popular approach to sequential decision
making problems, because it treats uncertainty in sensing and acting in a
principled way. In single-agent frameworks like MDPs and POMDPs, planning can
be carried out by resorting to Q-value functions: an optimal Q-value function
Q* is computed in a recursive manner by dynamic programming, and then an
optimal policy is extracted from Q*. In this paper we study whether similar
Q-value functions can be defined for decentralized POMDP models (Dec-POMDPs),
and how policies can be extracted from such value functions. We define two
forms of the optimal Q-value function for Dec-POMDPs: one that gives a
normative description as the Q-value function of an optimal pure joint policy
and another one that is sequentially rational and thus gives a recipe for
computation. This computation, however, is infeasible for all but the smallest
problems. Therefore, we analyze various approximate Q-value functions that
allow for efficient computation. We describe how they relate, and we prove that
they all provide an upper bound to the optimal Q-value function Q*. Finally,
unifying some previous approaches for solving Dec-POMDPs, we describe a family
of algorithms for extracting policies from such Q-value functions, and perform
an experimental evaluation on existing test problems, including a new
firefighting benchmark problem
Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL
Reinforcement learning agents may sometimes develop habits that are effective
only when specific policies are followed. After an initial exploration phase in
which agents try out different actions, they eventually converge toward a
particular policy. When this occurs, the distribution of state-action
trajectories becomes narrower, and agents start experiencing the same
transitions again and again. At this point, spurious correlations may arise.
Agents may then pick up on these correlations and learn state representations
that do not generalize beyond the agent's trajectory distribution. In this
paper, we provide a mathematical characterization of this phenomenon, which we
refer to as policy confounding, and show, through a series of examples, when
and how it occurs in practice
Reinforcement Learning by Guided Safe Exploration
Safety is critical to broadening the application of reinforcement learning
(RL). Often, we train RL agents in a controlled environment, such as a
laboratory, before deploying them in the real world. However, the real-world
target task might be unknown prior to deployment. Reward-free RL trains an
agent without the reward to adapt quickly once the reward is revealed. We
consider the constrained reward-free setting, where an agent (the guide) learns
to explore safely without the reward signal. This agent is trained in a
controlled environment, which allows unsafe interactions and still provides the
safety signal. After the target task is revealed, safety violations are not
allowed anymore. Thus, the guide is leveraged to compose a safe behaviour
policy. Drawing from transfer learning, we also regularize a target policy (the
student) towards the guide while the student is unreliable and gradually
eliminate the influence of the guide as training progresses. The empirical
analysis shows that this method can achieve safe transfer learning and helps
the student solve the target task faster.Comment: Accecpted at ECAI 202
Decentralized multi-robot cooperation with auctioned pomdps
ABSTRACT Planning under uncertainty faces a scalability problem when considering multi-robot teams, as the information space scales exponentially with the number of robots. To address this issue, this paper proposes to decentralize multi-agent Partially Observable Markov Decision Process (POMDPs) while maintaining cooperation between robots by using POMDP policy auctions. Auctions provide a flexible way of coordinating individual policies modeled by POMDPs and have low communication requirements. Additionally, communication models in the multi-agent POMDP literature severely mismatch with real inter-robot communication. We address this issue by applying a decentralized data fusion method in order to efficiently maintain a joint belief state among the robots. The paper focuses on a cooperative tracking application, in which several robots have to jointly track a moving target of interest. The proposed ideas are illustrated in real multi-robot experiments, showcasing the flexible and robust coordination that our techniques can provide
Bounded approximations for linear multi-objective planning under uncertainty.
Abstract Planning under uncertainty poses a complex problem in which multiple objectives often need to be balanced. When dealing with multiple objectives, it is often assumed that the relative importance of the objectives is known a priori. However, in practice human decision makers often find it hard to specify such preferences exactly, and would prefer a decision support system that presents a range of possible alternatives. We propose two algorithms for computing these alternatives for the case of linearly weighted objectives. First, we propose an anytime method, approximate optimistic linear support (AOLS), that incrementally builds up a complete set of -optimal plans, exploiting the piecewise-linear and convex shape of the value function. Second, we propose an approximate anytime method, scalarised sample incremental improvement (SSII), that employs weight sampling to focus on the most interesting regions in weight space, as suggested by a prior over preferences. We show empirically that our methods are able to produce (near-)optimal alternative sets orders of magnitude faster than existing techniques, thereby demonstrating that our methods provide sensible approximations in stochastic multi-objective domains
- …